Using methylation data to improve transcription factor binding prediction

ABSTRACT Modelling the regulatory mechanisms that determine cell fate, response to external perturbation, and disease state depends on measuring many factors, a task made more difficult by the plasticity of the epigenome. Scanning the genome for the sequence patterns defined by Position Weight Matrices (PWM) can be used to estimate transcription factor (TF) binding locations. However, this approach does not incorporate information regarding the epigenetic context necessary for TF binding. CpG methylation is an epigenetic mark influenced by environmental factors that is commonly assayed in human cohort studies. We developed a framework to score inferred TF binding locations using methylation data. We intersected motif locations identified using PWMs with methylation information captured in both whole-genome bisulfite sequencing and Illumina EPIC array data for six cell lines, scored motif locations based on these data, and compared with experimental data characterizing TF binding (ChIP-seq). We found that for most TFs, binding prediction improves using methylation-based scoring compared to standard PWM-scores. We also illustrate that our approach can be generalized to infer TF binding when methylation information is only proximally available, i.e. measured for nearby CpGs that do not directly overlap with a motif location. Overall, our approach provides a framework for inferring context-specific TF binding using methylation data. Importantly, the availability of DNA methylation data in existing patient populations provides an opportunity to use our approach to understand the impact of methylation on gene regulatory processes in the context of human disease.


Figure S1 :
Figure S1: Overlap between ChIP-seq and motif locations.The percentage of ChIP-seq locations that overlap with one or more motif location (left) and the percentage of motif locations that overlap with one or more ChIP-seq locations (right).For each cell line, the distribution across the assessed TFs is shown.

Figure S2 .
Figure S2.Comparing PWM-based predictions between motif locations with and without a CpG.(A)A comparison of the PWM-score's ability to predict TF binding (ChIP-seq) when limited to motif locations that contain, or do not contain, a CpG.A T-Test comparing performance within each of the cell lines confirms no significant difference.P-value for A549: 0.233128, GM12878: 0.566850, HeLa: 0.235387, HepG2: 0.336910, K562: 0.127878, and SKNSH: 0.365327.(B) Bar chart showing the percentage of motif locations containing a CpG before intersecting with the methylation data.

Figure S3 .
Figure S3.WGBS sequence depth analysis for a representative example: CEBPG in K562.(A) Scatter plots comparing WGBS to array values when restricting to assayed methylation sites that overlap with CEBPG motif locations and that also have a WGBS read depth greater than 0, 10, or 20.(B) Histograms of WGBS values for these same methylation sites.(C) Density curves (via fit kernel density estimates) comparing various methods for aggregating methylation β values, broken into groups depending on the number of methylation sites that overlap with the predicted TF motif locations: 2, 3, 4, 5, or 6+ CpGs per motif location.We note that the patterns in these plots are similar for other transcription factors and cell lines.

Figure S4 :
Figure S4: Distribution of the WGBS and array-based Methyl-scores across all assessed motif locations, separated by motif locations that overlap with a corresponding ChIP-seq peak (red line) and those that don't (black line).

Figure S5 .
Figure S5.Details supporting main analysis.(A) Same plot as shown in Figure 2A.Each point is a TF-cell line combination.(B) Table showing the T-test statistic and p-value when comparing the AUROC distributions for WGBS and methyl array to AUROC distributions for the PWM-score.(C) Heatmap showing individual cell line specific TF performance.Note that the rows are ordered based on the percentage of total motif locations that contain a CpG (see Figure S2B), indicating these results do not depend on the overall level of overlap with CpGs in the genome.(D) AUROC performance for each TF (averaged across cell-lines) when scoring motif locations based on WGBS and methyl array.We observe similar performance between the technologies, although related transcription factors often have similar levels of performance.Groups of related TFs are highlighted.Underlines indicate TFs classified by Yin et al as MethylPlus (red line) or MethylMinus (blue line).(E) Distribution of the difference in AUROC performance for each TF (averaged across cell-lines) comparing WGBS minus PWM-based scoring (blue) and comparing Methyl Array minus PWM-based scoring (orange).(F) Distribution of the difference in AUROC comparing WGBS and Methyl Array showing that a Methyl-score derived from the methylation array data generally performs better than one derived from WGBS data.

Figure S6 .
Figure S6.Detailed analysis based on genomic region.(A) Same plot as shown in Figure 2B.(B) Statistical analyses comparing the distribution of AUROC scores for each pair of regions when scoring motif locations using

Figure S7 .
Figure S7.Detailed analysis based on TSS annotations.(A) Same plot as shown in Figure 2C.(B) Statistical analyses comparing the distribution of AUROC scores for each pair of TSS annotations when scoring motif locations using PWM, WGBS or methylation array.Shades of red indicate significant differences based on a Ttest and associated p-value.(C) Cell line specific versions of panel (A).(D) Heatmap of specific TF performance averaged across cell lines per TSS annotation.Blue indicates an AUC>0.5 while red indicates AUC<0.5.(E) Differential AUROC between methylation array and WGBS indicating that scoring based on methylation array generally performs better than scoring based on WGBS data.(F) Transcription factors whose context-and cell line specific binding is better predicted using the PWM compared to methylation data and whose methylationbased predictive performance has an AUROC <0.5.Many of these are CEBP family members.

Figure S8 :
Figure S8: Distribution of AUROC scores when scoring motif locations using CpGs annotated to (1) both CpG Islands (Islands/Shores/Shelves) and gene promoters (TSS200 or TSS1500), (2) CpG Islands and not gene promoters, and (3) gene promoters but not CpG Islands.

Figure S9 :Figure S1 :
Figure S9:Average difference in the β values for all pairs of CpGs within a given range.For consistency with the main text, only CpGs that were assayed by both methylation array and in WGBS with a read depth of at least 10 were used.

Figure S2 .
Figure S2.Comparing PWM-based predictions between motif locations with and without a CpG.(A) A comparison of the PWM-score's ability to predict TF binding (ChIP-seq) when limited to motif locations that contain, or do not contain, a CpG.A T-Test comparing performance within each of the cell lines confirms no significant difference.P-value for A549: 0.233128, GM12878: 0.566850, HeLa: 0.235387, HepG2: 0.336910, K562: 0.127878, and SKNSH: 0.365327.(B) Bar chart showing the percentage of motif locations containing a CpG before intersecting with the methylation data.

Figure S3 .
Figure S3.WGBS sequence depth analysis for a representative example: CEBPG in K562.(A) Scatter plots comparing WGBS to array values when restricting to assayed methylation sites that overlap with CEBPG motif locations and that also have a WGBS read depth greater than 0, 10, or 20.(B) Histograms of WGBS values for these same methylation sites.(C) Density curves (via fit kernel density estimates) comparing various methods for aggregating methylation β values, broken into groups depending on the number of methylation sites that overlap with the predicted TF motif locations: 2, 3, 4, 5, or 6+ CpGs per motif location.We note that the patterns in these plots are similar for other transcription factors and cell lines.

Figure S4 :
Figure S4: Distribution of the WGBS and array-based Methyl-scores across all assessed motif locations, separated by motif locations that overlap with a corresponding ChIP-seq peak (red line) and those that don't (black line).

Figure S5 .
Figure S5.Details supporting main analysis.(A) Same plot as shown in Figure 2A.Each point is a TF-cell line combination.(B) Table showing the T-test statistic and p-value when comparing the AUROC distributions for WGBS and methyl array to AUROC distributions for the PWM-score.(C) Heatmap showing individual cell line specific TF performance.Note that the rows are ordered based on the percentage of total motif locations that contain a CpG (see Figure S2B), indicating these results do not depend on the overall level of overlap with CpGs in the genome.(D) AUROC performance for each TF (averaged across cell-lines) when scoring motif locations based on WGBS and methyl array.We observe similar performance between the technologies, although related transcription factors often have similar levels of performance.Groups of related TFs are highlighted.Underlines indicate TFs classified by Yin et al as MethylPlus (red line) or MethylMinus (blue line).(E) Distribution of the difference in AUROC performance for each TF (averaged across cell-lines) comparing WGBS minus PWM-based scoring (blue) and comparing Methyl Array minus PWM-based scoring (orange).(F) Distribution of the difference in AUROC comparing WGBS and Methyl Array showing that a Methyl-score derived from the methylation array data generally performs better than one derived from WGBS data.

Figure S6 .
Figure S6.Detailed analysis based on genomic region.(A) Same plot as shown in Figure 2B.(B) Statistical analyses comparing the distribution of AUROC scores for each pair of regions when scoring motif locations using PWM, WGBS, or methylation array.Shades of red indicate significant differences based on a T-test and associated p-value.(C) Cell line specific versions of panel (A).(D) Heatmap of specific TF performance averaged across cell lines per genomic region annotation.Blue indicates an AUC>0.5 while red indicates AUC<0.5.(E) Differential AUROC between methylation array and WGBS indicating that scoring based on methylation array generally performs better than scoring based on the WGBS data.

Figure S7 .
Figure S7.Detailed analysis based on TSS annotations.(A) Same plot as shown in Figure 2C.(B) Statistical analyses comparing the distribution of AUROC scores for each pair of TSS annotations when scoring motif locations using PWM, WGBS or methylation array.Shades of red indicate significant differences based on a Ttest and associated p-value.(C) Cell line specific versions of panel (A).(D) Heatmap of specific TF performance averaged across cell lines per TSS annotation.Blue indicates an AUC>0.5 while red indicates AUC<0.5.(E) Differential AUROC between methylation array and WGBS indicating that scoring based on methylation array generally performs better than scoring based on WGBS data.(F) Transcription factors whose context-and cell line specific binding is better predicted using the PWM compared to methylation data and whose methylationbased predictive performance has an AUROC <0.5.Many of these are CEBP family members.

Figure S8 :
Figure S8: Distribution of AUROC scores when scoring motif locations using CpGs annotated to (1) both CpG Islands (Islands/Shores/Shelves) and gene promoters (TSS200 or TSS1500), (2) CpG Islands and not gene promoters, and (3) gene promoters but not CpG Islands.

Figure S9 :
Figure S9: Average difference in the β values for all pairs of CpGs within a given range.For consistency with the main text, only CpGs that were assayed by both methylation array and in WGBS with a read depth of at least 10 were used.